Negative Databases for Biometric Data 

Julien Bringer^ and Hervé Chabanne^'^ 
O ■ ^ Sagem Sécurité, France. 

^ ! ^ Télécom ParisTech, France. 

>> 

=2 : May 11, 2010 



P^ 



Abstract 

Negative databases - negative representations of a set of data - have 

been introduced in 2004 to protect the data they contain. Today, no 

solution is known to constitute biometric negative databases. This is 

V,^ ' surprising as biometric applications are very demanding of such protection 

(y^ ' for privacy reasons. The main difhculty comes from the fact that biometric 

O ■ captures of the same trait give different results and comparisons of the 

stored reference with the fresh captured biometric data has to take into 

account this variabihty. In this paper, we give a first answer to this 

^►-^ . problem by exhibiting a way to create and exploit biometric negative 
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databases. 
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^ '• 1 Introduction 



Biometric data must be protcctcd in order to prevent someone to be able to 
track back the users of a biometric system. Recently, they have been a lot of 
researches on their storage in a way which is renewable and which does not 
leak information. Typically, biometric data are quantized and encrypted. This 
encryption must still permit the matching of the underlying biometric data 
without decrypting them. On one hand, some very simple techniques of encryp- 
tion, known as secure sketches have been suggested (TUlIIl] but their resistance 
seems doubtful in practice [31[3D]. On the other hand, secure sketches can be 
combined with homomorphic encryption but in this case, the performances of 
the computations are penalized [n[5l[2T]. 

In this paper, we have a different approach following the one of the negative 
databases [12]. In a negative database, instcad of having the elements of a 
database VB, we consider the complementary VB of these elements. This means 
that instead of checking whether b G VB, we have to equivalently verify that 
b ^ VB. The representation of negative databases is made possible thanks to 
a wild-card symbol * which stands for all the values; for instance, as we are 
here going to work with binary vectors, a * for a bit nicans cither the value 



or 1. Negative databases got very interesting properties. Firstly, for a given 
database VB, different negative databases VB can be established. Moreover, 
starting from VB, it is hard to retrieve VB. Finally, relational algebra also 
exists for negative databases. It should be noted that our approach is different 
from the previous one which treats biometric data individually while we are 
here considering a database as a whole. 

The difiiculty we encounter is that eaeh capture of the same biometric data 
gives a different value. We here consider binarized biometric data, i.e. b stands 
for a binary vector representing a biometric trait. A new capture of this bio- 
metric trait wih give a binary vector with numerous coordinates differing from 
the ones of b. Our approach follows the one described in for identifying people 
thanks to their iris [16]. We here how show to handle this problem. 

We begin in Section[2]by recalling some works on the binarization of biomet- 
ric data. We state the properties these binarized data have to fulfill for the rest 
of our work. In Section [3l we give a short introduction to negative databases. 
Section Ø] constitutes the core of our proposal and explains how to create and 
use biometric negative databases. We give an example to gauge the efficiency of 
our solution when applied to a real-life case. And finally, Section [6] concludes. 

We conclude this introduction by recalling some basic facts about biometric 
data. A reader, familiar with this topic, can skip it to go directly to Section [2] 

1.1 Biometric Systems: 101 

Biometric recognition techniques can lead to quite different applications than 
those which are possible when you are dealing with, for instance, passwords. 
A major difference comes from the fact that your biometric data enable to 
identify yourself among a large set of people during your whole lifetime. This 
characteristic is reinforced by the fact that biometric data are non-transferable 
to someone else. On one hand, a positive aspect is that this can been seen as 
a very natural and easy-to-deploy way of identifying populations (think at the 
census of the Citizens of a country). On the other hand, biometric data must 
be protected to respect their privacy. Today, AFIS (Automated Fingerprint 
Identification System) are present in many countries worldwide for police or civil 
applications. These AFIS can gather together the biometric data of millions of 
users. Usually, for measuring their performances, we evaluate their accuracy in 
terms of FAR (False Acceptance Rate: the probability that the system rejects 
a genuine user) and FRR (False Reject Rate: the probability that the system 
accepts an impostor). These 2 rates FAR and FRR cannot be reduced both 
at the same time and some compromise must be found according your wish to 
favour security or comfort. In this paper, we are looking at biometric systems 
of smaller scale. Typically, we are considering biometric readers mainly used 
for restricting the access control to a building or a room. The need for storage 
of biometric data is then limited to less than one hundred records. 



2 Binarization of Biometric Data 

Let /3 designate the biometric trait of an user. Let 6 <— /3 indicate that the value 
b has been captured by a sensor. 

We here make the hypothesis that the biometric b are quantized and can be 
presented as binary vectors of length n, b € {0, 1}" in such a way that: 

Condition 1 1. Two different captures b,b' from the same user IA are with 
high probability at a Hamming distance d(b,b') < Xmin- 

2. Captures bi , 62 of different users lAi , hl2 are at a Hamming distance d{bi , ^2) > 

The origin of Condition [T] comes from iris recognition technology [5| where 
the concept of iriscodes has been introduced. Iriscodes are binary vectors of 
length n = 2048 where the features of iris are represented. They are compared 
by their Hamming distance. Following this, various attempts have been made 
for applying this kind of representations to fingerprints ^J^ or faces as in [B] . 

Condition [T] has been exploited in [TB] for an efhcient search algorithm over a 
large database. With a high probability the vectors we are comparing will have 
many small portions of their coordinates equal whenever they come from the 
same user as they are close for the Hamming distance. Let VB — {bi, . . . , b^}. 
[16] uses hl, ... , hi2s, projections of binary vectors (here iriscodes) over a part 
of their coordinates. They consider that a freshly captured iris b can match an 
element stored in VB whenever: 



hj, (6) = hj, (bk), hj^ (6) = hj^ (bk), %, (6) = /ij3 (6fe) (1) 

for 1 < ji < j2 < J3 < 128 and 1 < k < n. 

We here skip the details. This way of proceeding will serve us as the basis of 
our work on biometric negative databases which is described in Section 2] We 
recall a definition that formalizes this notion and which has been introduced for 
approximate nearest neighbor search: 

Definition 1 (Locality-Sensitive Hashing, |17p Let{E,d) be the Hamming 
space, F be a set, ri,r2 6 K with ri < r2, Pi,P2 G [0,1] with pi > P2- Let 
H = {hl, . . . , h^} be a family of functions hi : E -^ F . 
The family H is {ri,r2,Pi,P2)-LSH, if 

V a; x' ^e{ P^heHlHx) =^ h{x')\d{x,x') < ri] > pi 
\ Prhen[Kx) = H^') I d{x, x') > ra] < P2 

In the sequel we will assume that the inequalities above are equalities, i.e. 
the first (resp. second) probability is equal to pi (resp. P2). This holds for the 
hash constructions used by |16) . 

In practice, we apply Condition [T] to a family of Locality-Sensitive Hashing 
functions that are combined to obtain the following result: 



Proposition 1 Let the family % = {hi, . . . , h^} he a {Xmin, Xmax,Pi,P2)-LSH 
family. Let m be the number of hash functions that we consider simultaneously, 
foUowing the principle of Equation (QP. 

The probability not to output 'matching' for a genuine user - called False 
Reject Rate (FRR) - is approximately 

rn — 1 / T- \ 



i=0 



^/^ = Ejrf(i-pi) 



and the probability to output 'matching ' for an impostor - called False Accept 
Rate (FAR) - is approximately 



^> = E(tH(i-P2) 



2— m 



L-i 



The variable m is called order of the hash equalities in the sequel. 

Proofhet bl, &2 be two different biometric captures (possibly coming from 
the same user). They are considered as a matching pair as soon there is at least 
m functions /i^^ , . . . , hi^^ from Ti such that 

hi^ibi) = hi^{b2),...,hi^{bi) ^ hi^{b2). 

As T-L is (Amm, Amoæ,Pi,P2)-LSH, from Condition [TJ we know that with a 
high probability Prh£-H[h{hi) = h{b2)] = pi if 61,62 come from the same user 
and Prhen[h{hi) = ^(62)] = P2 if they come from different users. □ 

3 Negative Databases 

A negative database consists of the representation of the negative image of a 
given database. The goal is to represent all the elements not in the original 
database instead of storing explicitly the data itself. The main issue is to find a 
way to represent concisely the negative database without letting the possibility 
to retrieve easily the original records. Note that there exists an extended notion 
of negative database where almost all elements not in the original database are 
represented. In this section, we refer only to the first notion for which the 
definition - inspired by by Esponda et al. |12j and subsequent works - follows. 

Definition 2 Let VB be a database containing N vectors. Let I denote the 
length of the binary vectors belonging to VB. A negative database DB of VB is 
a finite set of vectors such that 

• there is an algorithm IsMember-j^ enabling to check efficiently whether one 
I bits string is a member of "DB; 

• a; G {0,1}' is an element represented by DB if and only if x ^ DB, i.e. 
that DB represents {0, 1}' - DB. 



If I is large, then the number of elements represented by VB could be huge. 
To achieve a compact representation, [M] introduced the wild-card symbol '*', 
with the classical role: a position set to * in a string represents both and 1 
values. This doing a negative database is composed of vectors of length I with 
the alphabet {0,1,*}. The associated IsMember^g algorithm corresponds to 
the string matching algorithm; a element a; is a member if VB contains a vector 
y for which simultaneous binary valued positions of x and y should be equal: 



Vi G {1, ... , 1}, {xi = yi OR Xi 



ORy, 



Several polynomial-time algorithms have been published for negative databases 
computation [TJIHIIIS] with quite different techniques. We recall in Table[T]the 
algorithm introduced in [T^ to compute a negative representation of a database 
thanks to the prefix method. Wi will stand in the following to a prefix of length 
i and Wi will represent the set of vectors of length i (a part of the Wi 's) . 



1. 


i^O 


2. 


W.^{} 


3. 


W^(i+i) <— set of vectors of length i + 1 with a prefix of length i in Wi 




which is not a prefix of an element of VB 


4. 


for all x in Wi+i 


5. 


output a vector y with a; as a prefix and complete it with *'s up to 




length 1 


6. 


add to VB 


7. 


i ^ i + 1 


8. 


Wi <— set of prefixes of length i in VB 


9. 


go back to step 3 while i < / 



Table 1: Prefix algorithm for negative representation 

As proved in |14| . the prefix algorithm outputs a negative database in IN 
time (where N is the number of records of VB) and the resulting negative 
database will have at most IN entries (elements of {0, 1, *}'). This leads for T>B 
to an overall size {2 x N x P). 

The prefix algorithm shows the feasibility of negative representation. As for 
security concerns, [T4| demonstratcd that the reconstruction of VB from such 
a negative database with alphabet {0, 1, *} is NP-hard, based on reduction to 
3-SAT problem. Concerning the way to construct hard instances which resist 
to SAT solvers, the known algorithms produce negative representations with 
different level of security, depending for example on the number of *'s used. 
See [S1II3] for solutions on how to generate such hard instances. One another 
interesting property is the capability to create several different randomized neg- 
ative representations from the same VB. 

To obtain randomized representation, [13] suggests a non deterministic ver- 
sion of the prefix algorithm based on: 



• the use of random permutation to include the wild-card symbol also at 
the beginning or in the middle of a vector; 

• a random replacement of wild-card by both the values and 1. 

The resulting algorithm is designed to run in time PN'^ and produces negative 
databases of size 2 x N x l^. 

From a given negative database, it is even possible to transform it into a new 
negative representation which is the negative image of the same database, but 
for which the two negative databases are different and it is difficult to determine 
if they are equivalent. In [IT], this operation is denoted Morph. 

Together with the creation of a negative representation, specific operations 
on DB are translated in operations applied on DB. Inserting (resp. deleting) 
a string into (resp. from) DB corresponds to remove (resp. insert) the corre- 
sponding binary strings from (resp. into) VB. As for the creation algorithm, 
randomized operations are also available [12j . Then an alternative solution for 
creating randomized representation is to start with a randomized negative rep- 
resentation of the empty set and to delete the data corresponding to VB. 

As remarked by |13) , using these operations many times may have the effect 
on the size of VB to grow unreasonably. An algorithm, somehow to clean 
regularly the representation, is designed to control this bad effect (cf. Clean-Up 
algorithm [13] )• 

Moreover, more complex operations are designed in [15] in the field of rela- 
tional algebra operators (equality, less-than, union, cartesian product, intersec- 
tion, . . .). Each operator on VB has its transposition as an negative operator 
onVB. 

Note that several other techniques are analyzed in the literature. For in- 
stance [7| introduced a negative representation with an higher overall growth 
factor but for which the security relies on cryptographic hash properties. 

4 Our Solution 

We explain now how to manage biometric data via negative representation in 
order to protect the content of the enroUment database. The main motiva- 
tion is authorization scenario, nonetheless our construction can also be used for 
authentication. 

4.1 Biometric Negative Database 

We recall that our database VB ~ {bi, . . . ,bpf}. Let H — {/ii, . . . , /i^} be 
(Amin, Amaæ,Pi,P2)-LSH family of functions as defined in Definition [T] Section 
[21 Let m be the order of the hash equalities, i.e. the number of these functions 
we are using as in Prop. [1] We are now ready to define the biometric negative 
database VB. 

For 1 < fc < A^, let VBb^. be the database made of the elements 

n\\...\\U\h,,ibk)\\...\\h,^{bk) (2) 



with 1 < ji < . . . < jm l£ L and where | [ stands for the concatcnation. 

Definition 3 An [I-L, m, 'DB)-hiometric negative database VB is a negative rep- 
resentation of the dataset IA = IJa;=i "^^bk which is made of all order m hash 
chains obtained with respect to VB and the LSH family % (cf. Equation ^). 



The algorithm to determine if one fresh capture b' is close to one enrolled tem- 
plate within the original dataset "DB via one of its biometric negative database 
VB is explained by Table[2J 



Input: fresh biometric template b' 




1. 


result<- OK 




2. 


For all 1 < ji < . . . < jm < i 




3. 


z^n\\---\\J^\\hn{h')\\--- 


\h,^(h') 


4. 


If lsMemberpB(2) = OK 




5. 


Then 




6. 


result^ NOK 




7. 


Break 




8. 


Else Continue 




9. 


End For 




10. 


Output result 





Table 2: Authorization Chcck for Biometric Negative Database 

This construction enables us to achieve the following properties. 
Proposition 2 Let DB be an {I-L, m, DB) -biometric negative database. 

1. DB is not a negative database of DB. 

2. DB enables to check whether a fresh capture b' is close to one element of 
DB by verifying whether all the derived hash chains following Eq. (0) are 
not in DB. The error rates of this membership checking operation are: 

False 'Not Member' Decision 

Pfril-Pfaf^' 

False 'Member' Decision 

1-(1-P/,)^. 
Proof. 



1. The first statement comes from the fact that VB is a negative database 
oiU = Ufe=i ^'Sfefci which is not equivalent to VB, thanks to Proposition 
[T](except in the case where pi,P2 are negligible, which corresponds to the 
situation where the hash functions are cryptographic ones with hard-to- 
find collisions; i.e. the case where almost only original data satisfy the 
order m hash equalities). 

2. The second statement is induced together by Proposition [T] and Dcfinition 
[21 The algorithm IsMember^pg gives a way to decide whether data are in U 
from the negative test of presence in VB. Moreover, Proposition[T]implies 
that for a genuine user, the probability not to find an order m hash chain 
following Eq. ([2]) in iY is approximately 



Pfr{l-Pfa 



,JV-1 



(i.e. that you find neither any genuine equalities nor any impostor equal- 
ities). Similarly, the probability for an impostor to find an order m hash 
chain following Eq. ([2]) in W is 

a 

When using the prefix algorithm described in Table [Tl or its randomized 
variant, the overall size of a biometric negative database is given by the following 
lemma. 

Lemma 1 The size of an {'H,m,VB)-biometric negative database VB obtained 
through the prefix algorithm is at most 

2x N x ( \ xl"^ 

where I is the length of the binary representation of an order m hash chain as 
in Eq. Øj. Via its randomized variant, the upper size becomes 

2x N x ( ]xf. 

Proof. VB is a negative database oiU = Ufc=i ^^bk- Each VBb^. contains all 
order m hash chains obtained for the template bk, i.e. all the m choices of hash 
functions among the L from H. The remaining is deduced from the expansion 
when the algorithms for negative representation of Section [31 are applied. □ 

When the original database contains N templates of length n, in the deter- 
ministic case, the expansion factor is 2 x (^J x P/n and respectively 2 x (^J x P/n 
in the randomized case. 

Note that although we focus here on the expansion when applying the prefix 
algorithms, our concept is compatible with any algorithm for negative represen- 
tation generation. 



4.2 Security Discussion 

Concerning the confidentiality of the original templates, the choice of an ade- 
quate negative database creation algorithm leads to: 

Proposition 3 Let DB be an {Ti, m, T)B)-biometric negative database generated 
through an algorithm which outputs hard instance for the reconstruction problem 

(cf. Section\§\). 

The knowledge of DB does not allow to retrieve the list of original templates 
inVB. 

This is straightforward, as by the above assumption, reconstructing U is 
hard. Note that even without that reconstructing the original templates might 
be a difficult task: if one succeeds in retrieving part of the hash chains of U 
from VB, he still has no idea of which hash chains are related to the same orig- 
inal template. Nevertheless, one additional advantage on using hard instances, 
as already mentioned in Section |3l is the possibility to create several, and un- 
linkable, randomized representations. This means that it is possible to have 
different biometric negative databases from the same authorization list. For 
instance this fits well to access control use cases where the local terminals can 
possess the same authorization list, without - thanks to our technique - letting 
the possibility to make a direct correlation between them. 

Other advantages are implied by these negative representations and their 
randomization, it hides the number of data which are negatively represented 
and after insertion or deletion of data, applying randomization on a negative 
database enables to hide the size variation of the original dataset. 

4.3 Operations 

On biometric negative databases as defined in Dcfinition[3l we can apply all the 
operations available for classical negative databases - see Section[3]- such as the 
randomization, insertion, deletion, clean-up, relational algebra, . . ., operations. 
Nevertheless all operations on DB have not the same impact with respect to the 
original database VB. 

In particular, we are interested in the two following operations. 

EnroUment of a new user, via a capture bjv+i, is straightforward using the 
insertion functionality of the negative database for all the hash chains computed 
from hN+i- 

Revocation of a user is less easy. Due to the property of the biometric neg- 
ative database structure, only authorization checks are possible and there is no 
way to link different hash chains together. So when using deletion functionality 
of negative databases, with respect to a fresh capture b' which is prior deter- 
mined as authorized (i.e. close to a 6^ for some k), you can only delete the 
hash chains found for b' (i.e. to add in VB the hash chains computed from 
h'). Thus this will not suppress all the hash chains related to hk- Moreover, 
if the prior authorization corresponds to a False 'Member' Decision, then this 
revocation would affect other users. This last point is a common problem for 



anonymous membership checks in biometric systems. Concerning the former is- 
sue, we suggest to mitigate it by avoiding deletion of database through VB but 
by adding a dedicated revocation database (blacklist), which can be represented 
in a negative form. 

4.4 Variant for Authentication 

The Definition [3] leads to the construction of a biometric negative database 
for authorization checks purpose. When authentication (1-to-l) is aimed, the 
variant below is given. 

Definition 4 An (H, tti, DB) -biometric negative database for authentication DBauth 
is the union of negative representations T^Bb^. of the datasets 'DBb,, (k = I . . . N ). 
More precisely, we defined VBauth «s the database containing for all k, the ele- 
ments of T^Bbk with k appended: 

N 



VBauth^ \J[DBb,,k] 



k=l 

Appending k enables to restrict the verification to one sole VBb^ ■ The 
associated algorithm for authentication check, with a fresh capture b' and an 
identity claim i, is described in Table [3l Note that this construction can be 
used as well for identification use cases where no identity claim is provided by 
running comparisons with all elements of VBauth '■ the list of possible identities 
would be the list of index k for which b' is not detected as member of VBb,. ■ 



Input: fresh biometric template b' and identity claim i 


1. 


result<- OK 


2. 


For all 1 < ji < . . . < j™ < i 


3. 


Z^Jl\\...\\jm\\h,,{b')\\...\\h,^{b') 


4. 


If lsMember:pB^ (^) = OK 


5. 


Then 


6. 


result^ NOK 


7. 


Break 


8. 


Else Continue 


9. 


End For 


10. 


Output result 



Table 3: Authentication Check for Biometric Negative Database 
Similar to the results explained in Section [4.H we have: 



Lemma 2 Let VBauth be an ['H,m,'DB)-biometric negative database for au- 
thentication. 
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• The False Reject Rate of the authentication check (Table \^ is Pfr ■ 

• The False Accept Rate of the authentication check (Table\^ is Pfa- 

• When using the randomized variant of the prefix algorithm, the size of 
VBauth is 

2x N x ( \ xP. 

Note that here, we give only the size with respect to the randomized variant 
of the prefix algorithm. We cannot use the deterministic variant as there is no 
mixing of the different datasets T>Bk that would ensure unhnkabihty of the hash 
chains. 

Remark 1 This is interesting that the expansion factor in the authorization 
scenario is lower or equal to the expansion factor in the authentication scenario. 



5 An Example 

We describe now the apphcation of our concept to the example of functions used 
in [TB" for iris Identification. The functions used by [TBJ are L — 128 different 
restrictions of the 2048 bits iris vectors to 10 bits. 

Let Xmin = 0.25 • 2048 = 512, X,nax = 0.35 • 2048 = 716.8 be the valnes 
of Condition [T] Then the above family of functions is an {Xmin,Xm,ax,Pi,P2)- 
LSH family, with the probability pi to obtain the same small proportion for 
two iriscodes coming from the same user pi = (1 — "^"'g )^° ~ 0.056 and the 
probability to obtain the same valne for two iriscodes coming from different 

users P2 = (1 - iMf )^° - O-^l^- 

If we choose a hash chain order m equal to 4, then the binary length of 
the hash chains is ^ = (7 + 10) x 4 = 68 and the error rates are Pfr — 0.066 
and Pfa — 0.095. This shows the interest of using hash chains. This choice 
of parameteres is valuable for authentication use case. This needs still to be 
improved to be efficient for authorization use case with medium scale databases. 

According to Lemma [TJ the upper bound for expansion factors when using 
the deterministic prefix algorithm will be about 2^^-^ and about 2^^-^ with the 
randomized prefix algorithm. 

As an even more practical example, if wc take the order m equal to 3 - 
which is the threshold used in [16_ for determination of candidates in an iris 
identification scenario - and TV — 100 enrolled iriscodes from different users, 
then it leads to an overall biometric negative database of about 20 Go. 



6 Conclusion 

This paper introduces the notion of negative database for biometric data. While 
the concept of negative database has been introduced in 2004, our proposal is 
the first one, as far as we know on this very subject. Although the storage of 
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biometric data seems a natural field of application for this concept of negative 
database, one should understand that what make our work possible are the prior 
researches on the quantization of biometric. Indeed, our contribution exploits 
the simpler matching algorithm that this quantization permits. This is not the 
first time that this new matching algorithms raise resuhs. For instance, beyond 
their own interest as an alternative to traditional matching algorithms, they are 
currently considered as a part of the solution for biometric Identification in a 
encrypted way [1]. 

We believe that our scheme can still be optimized. For example, we directly 
store in the positive database VB the biometric data tåken during the enrollment 
phase. Whereas, we are interested in the whole Hamming ball of radius Amm- 
A elever representation of biometric data should lead to more compact positive 
database and negative database. 
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